Unemployment rate.
In this example, there is a correlation between the level of education and the unemployment rate. We observe a correlation between two variables: More education = less unemployment, but this does not necessarily mean that there is a cause-and-effect relationship (causality) between these variables. Unmeasured factors such as the economy, politics, or social conditions can influence both education and unemployment without there being a direct link between them.
-
Data: OECD Study (Organisation for Economic
Co-operation and Development) - Input: PISA Score -
Output: unemployment rate
Chocolate consumption per capita.
In the following case, we observe a linear correlation between two variables, such as the number of Nobel Prizes in a country and chocolate consumption. It is unlikely that this implies a direct causal relationship between these variables. This situation illustrates the principle that correlation does not prove causation.
The Ozone Example (With Ⓡ )
This dataset (Air Breizh, Summer 2001) contains 112 observations and 13 variables.
library(rmarkdown)
ozone1 <- read.table("ozone1.txt",header=TRUE, sep="", dec=",")
paged_table(ozone1)Let’s correct the incorrect declaration of variables:
💡 Good to Know!
With Ⓡ , to change a column to date type, you first need to convert it to a factor, then specify the format.
Here, across is used with
where(is.character) to target only the columns that are of
character type (chr) and apply as.numeric to these
columns.
library(dplyr)
ozone1$Date <- as.Date(as.factor(ozone1$Date),format = "%Y%m%d")
ozone1$vent<-as.factor(ozone1$vent)
ozone1$pluie<-as.factor(ozone1$pluie)
ozone1 <- ozone1 %>% mutate(across(where(is.character), as.numeric))
ozone1 <- ozone1 %>% mutate(across(where(is.integer), as.numeric))
str(ozone1)## 'data.frame': 112 obs. of 14 variables:
## $ Date : Date, format: "2001-06-01" "2001-06-02" "2001-06-03" ...
## $ maxO3 : num 87 82 92 114 94 80 79 79 101 106 ...
## $ T9 : num 15.6 17 15.3 16.2 17.4 17.7 16.8 14.9 16.1 18.3 ...
## $ T12 : num 18.5 18.4 17.6 19.7 20.5 19.8 15.6 17.5 19.6 21.9 ...
## $ T15 : num 18.4 17.7 19.5 22.5 20.4 18.3 14.9 18.9 21.4 22.9 ...
## $ Ne9 : num 4 5 2 1 8 6 7 5 2 5 ...
## $ Ne12 : num 4 5 5 1 8 6 8 5 4 6 ...
## $ Ne15 : num 8 7 4 0 7 7 8 4 4 8 ...
## $ Vx9 : num 0.695 -4.33 2.954 0.985 -0.5 ...
## $ Vx12 : num -1.71 -4 1.879 0.347 -2.954 ...
## $ Vx15 : num -0.695 -3 0.521 -0.174 -4.33 ...
## $ maxO3v: num 84 87 82 92 114 94 80 99 79 101 ...
## $ vent : Factor w/ 4 levels "Est","Nord","Ouest",..: 2 2 1 2 3 3 3 2 2 3 ...
## $ pluie : Factor w/ 2 levels "Pluie","Sec": 2 2 2 2 2 1 2 2 2 2 ...
Objective. Predict the maximum ozone
concentration in the air (maxO3). Initially, let’s focus
solely on the temperatures recorded at 12 PM (T12). Find a
function \(f\) such that
maxO3 \(\approx
f\)(T12)
with \(X\) = T12:
Temperatures recorded at 12 PM and \(Y\) = maxO: Maximum ozone
concentration in the air.
The function \(f\) is called the regression function!
Question: To which family of functions does \(f\) belong?
Find the function that minimizes the sum of the squared distances from the function to the points \((\text{T12}_i, \text{maxO}_i)\).
\[\widehat{f}\in\arg\min_{f} \sum_{i=1}^n(\text{maxO}_i-f(\text{T12}_i))^2\]
Let’s consider the simple linear model, such that that
maxO3 \(\approx\)
affine function of T12 \(\Leftrightarrow\) \(y \approx f(x) \approx \beta_0 + \beta_1
x\), where \(x\) =
T12 and \(y\) =
maxO3.
Let \(i=1,\cdots, n\), with \(n\) the number of observations \((\text{maxO}_i,\text{T12}_i)\), we define the simple linear model as follows \[ y_i=\beta_0+\beta_1x_i+\epsilon_i, \] where
\(x_i\) = T12 is
called the explanatory variable, regressor, predictor,
covariate, etc.
\(y\) = maxO3 is
called the response variable, variable to explain, the
target, the label, etc.
\(\epsilon\) is called the error, theoretical residual
\(\beta_0\) is called the intercept
\(\beta_1\) is called the slope or weight of the variable \(x\)
Matrix form. Let define the following quantities
Problem \(\boldsymbol{\beta}=(\beta_0, \beta_1)^\top\) is unknown!
Solution Estimate them from our sample of \(n\) observations.
Modeling involves assuming that the \(y_i\) are realizations of random variables \(Y_i\) whose distribution we specify.
The simple linear gaussian model. Let \((Y_i,x_i)_{i=1,\cdots,n}\), we define the simple linear gaussian model as follows \[ Y_i=\beta_0+\beta_1x_i+\epsilon_i,\quad i=1,\cdots,n \] where the following Assumptions are satisfied:
These assumptions can be summarized as \(\epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2)\).
The simple linear gaussian model (matrix form).
\[\mathbf{Y}=\beta_0\boldsymbol{1}_n+\mathbf{X}\beta_1+\boldsymbol{\epsilon}=\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\] where the following Assumptions are satisfied:
These assumptions can be summarized as \(\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2\mathbf{I}_n)\).
Prediction with the estimated model. \[\widehat{Y}_i = \widehat{f}\big(x_i\big) = \widehat{\beta}_0 + \widehat{\beta}_1 x_i\]
COMMENTS
📝Assumption [P1]. In practice, this means that the model is correct and that we have not forgotten a relevant term : the model is linear.
📝Assumption [P2]. We speak of a Homoscedastic model, as opposed to a heteroscedastic model where the error term would not have the same variance for all observations.
📝Assumption [P3]. Thus, the observations are assumed to be uncor- related, i.e. independent sampling or the results of a physical experiment conducted under independent conditions. Problems can arise when time has an importance in the phenomenon.
📝Assumption [P4]. This postulate is the least important as we will be able to do without it if the number of observations is large (larger than 20 or 30 observations). It is difficult to detect the non gaussianity of errors. We will see that under Ⓡ propose graphical tools to try to validate or not this postulate.
Ordinary least squares (OLS) estimator
To find \((\widehat{\beta}_0, \widehat{\beta}_1)^\top\), we minimize the sum of squared errors from the model \[Y_i = f(x_i) + \epsilon_i = \beta_0 + \beta_1 x_i + \epsilon_i, \qquad i = 1, \cdots, n\] That is, \((\widehat{\beta}_0, \widehat{\beta}_1)\) are the arguments that minimize the sum of squared residuals
\[ \begin{array}{l} \widehat {\boldsymbol{\beta}}=(\widehat{\beta}_0, \widehat{\beta}_1) &\in \underset{(\beta_0, \beta_1)}{\arg\min} \sum_{i=1}^n (Y_i - f(X_i))^2\\ &= \underset{(\beta_0, \beta_1)}{\arg\min} \sum_{i=1}^n (Y_i - \beta_0 - \beta_1 X_i)^2=\underset{\boldsymbol{\beta}\in\mathbb{R}^2}{\arg\min} \|\mathbf{Y}-\mathbb{X}\boldsymbol{\beta}\|^2\\ &=\underset{\boldsymbol{\beta}\in\mathbb{R}^2}{\arg\min} \sum_{i=1}^n \epsilon_i^2=\underset{\boldsymbol{\beta}\in\mathbb{R}^2}{\arg\min} \|\boldsymbol{\epsilon}\|^2 \end{array} \]
This estimator is called the least squares estimator, which is also the maximum likelihood estimator in this context.
Let’s plot this line (here with Ⓡ ) is called the least squares line.
library(ggplot2)
library(plotly)
p1<-ggplot(ozone1) + aes(x = T12, y = maxO3) + geom_point(col = 'red', size = 0.5) + geom_smooth(method = "lm", se = FALSE)
ggplotly(p1)With Ⓡ , we can compute \(\widehat {\boldsymbol{\beta}}=(\widehat{\beta}_0, \widehat{\beta}_1)\)
library(kableExtra)
mod <- lm(maxO3 ~ T12, data = ozone1)
mod$coefficients %>% kbl(col.names = c("Estimation"))| Estimation | |
|---|---|
| (Intercept) | -27.419636 |
| T12 | 5.468685 |
The Ozone dataset actually contains much more information. In
addition to the response variable maxO3 and the covariate
\(X_1 :=\)T9, there are
\(X_2 :=\)T12, \(X_3:=\)T15, \(X_4 :=\)Ne9, \(X_5:=\)Ne12, \(X_6 :=\)Ne15, \(X_7 :=\)Vx9, \(X_8:=\)Vx12, and \(X_9:=\)Vx15.
Define by \(\mathbf{y} = \left(y_1\ \cdots \ y_n\right)^\top\) and for \(p=9\) the design matrix is such that \[ \mathbb{X}=(\boldsymbol{1}_n,\mathbf{X}_1,\cdots,\mathbf{X}_p)=\begin{pmatrix} 1 & x_{11} & \cdots & x_{1p}\\ \vdots & \vdots&\vdots&\vdots\\ 1 & x_{i1} & \cdots & x_{ip}\\ \vdots & \vdots&\vdots&\vdots\\ 1 & x_{n1} & \cdots & x_{np} \\ \end{pmatrix}=\begin{pmatrix} \mathbf{x}^\top_1 \\ \vdots\\ \mathbf{x}^\top_n \end{pmatrix}\in\mathbb{R}^{n\times (p+1)} \]
The multivariate linear gaussian regression model. Let \((Y_i,x_i)_{i=1,\cdots,n}\), we define the simple linear gaussian model as follows \[ Y_i=\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip}+\epsilon_i,\quad i=1,\cdots,n \]
The multivariate linear gaussian regression model. (matrix form).
\[\mathbf{Y}=\beta_0\boldsymbol{1}_n+\mathbf{X}_1\beta_1+\cdots+\mathbf{X}_p\beta_p+\boldsymbol{\epsilon}=\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\] where the following Assumptions are satisfied:
These assumptions can be summarized as \(\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2\mathbf{I}_n)\).
Prediction with the estimated model. \[\widehat{Y}_i = \widehat{f}\big(x_i\big) = \widehat{\beta}_0 + \widehat{\beta}_1 x_{i1}+\cdots+\widehat{\beta}_p x_{ip}\quad \text{and}\quad\widehat{\mathbf{Y}}=\mathbb{X}\widehat {\boldsymbol{\beta}}\]
COMMENTS
📝 we could use only the original covariates or create new ones by transforming the existing ones (take the square or the logarithm, etc. . . ) . This is called feature engineering.
📝 Thus, a linear model does not mean that the relationship between the predictors and the response variable is linear, but that the model is linear in the \(\beta_j\) parameters.
Ordinary least squares (OLS) estimator
The OLS estimator is such that \[ \widehat {\boldsymbol{\beta}} \in \underset{\boldsymbol{\beta}\in\mathbb{R}^2}{\arg\min} \|\mathbf{Y}-\mathbb{X}\boldsymbol{\beta}\|^2= \underset{\boldsymbol{\beta}\in\mathbb{R}^2}{\arg\min}\sum_{i=1}^n \big(Y_i -(\beta_0+ \sum_{j=1}^px_{ij}\beta_j)\big)^2 \]
📝 Note that! For the course, we assume that \((p+1)\leq n\).
Rank assumption. The matrix \(\mathbb{X}\) is full rank, \(\text{rank}(\mathbb{X}) = p+1\).
COMMENTS
📝 The model is identifiable if for any \(\boldsymbol{\beta}\) and \(\boldsymbol{\beta}'\) in \(\mathbb{R}^{p+1}\) such that \(\mathbb{X}\boldsymbol{\beta}=\mathbb{X}\boldsymbol{\beta}'\), then \(\boldsymbol{\beta}=\boldsymbol{\beta}'\). This guarantees the uniqueness of the regression vector. This is a very important hypothesis and is realized as soon as the matrix \(\mathbb{X}\) is injective, which is equivalent to \(\text{rank}(\mathbb{X}) = p+1\)
📝 This hypothesis is crucial if we
want to study the importance of the covariate but not necessary for the
prediction problem : Predict future values of Y based on future
observations of predictors.
[Proposition 1]:
Let \(\mathbb{X}\) be a design matrix of size \(n \times (p+1)\) with full rank (\(\text{rank}(\mathbb{X}) = p+1\)). Let \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), where \(\boldsymbol{\beta} \in \mathbb{R}^{p+1}\) is a linear model. Then the least squares estimator is unique \[ \widehat{\boldsymbol{\beta}} = \arg\min_{\boldsymbol{\beta} \in \mathbb{R}^{p+1}} \|\mathbf{Y} - \mathbb{X} \boldsymbol{\beta}\|^2 \] and can be written as \[ \widehat{\boldsymbol{\beta}} = \widehat{\boldsymbol{\beta}}(\mathbf{Y}) = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y} \]
Proof. Will be done in the tutorial.\(\square\)
COMMENTS
📝 Let \(\mathbb{X}\) be the \(n\times (p+1)\) design matrix and denote by \([X]:=\mathcal{I}m(\mathbb{X}),\) the space generated by the \(p+1\) columns of \(X\).
📝 Let denote by \(P_X\) the orthogonal projection into \([X]\). Then, \(P_{X^{\perp}}=\mathbf{I}_n-r_X\) is the orthogonal projection matrix into \([X]^{\perp}\) the orthogonal space at \([X]\) \[\widehat{\mathbf{Y}}=\mathbb{X}\widehat{\boldsymbol{\beta}}=\mathbb{X}(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top\mathbf{Y}=P_X\,\mathbf{Y},\] then Note \(\widehat{\mathbf{Y}}=\mathbb{X}\widehat{\boldsymbol{\beta}}\) is the orthogonal projection of \(\mathbf{Y}\) into \([X]\):
📝 Note that in the simplest model reduced to the intercept \(\mathbf{Y} ={\boldsymbol{\beta}}\mathbb{1}_n + \boldsymbol{\epsilon}\), The least square estimator of the real parameter \(\beta\) is \(\widehat{\beta}=\overline{\mathbf{Y}}\).
Let’s revisit the ozone example and, for this chapter, select only the numerical regressors with Ⓡ
## [1] "Date" "maxO3" "T9" "T12" "T15" "Ne9" "Ne12" "Ne15" "Vx9" "Vx12"
## [11] "Vx15" "maxO3v" "vent" "pluie"
## [1] "maxO3" "T9" "T12" "T15" "Ne9" "Ne12" "Ne15" "Vx9" "Vx12" "Vx15"
## [11] "maxO3v"
Under Ⓡ ,we can declare
the linear model in two ways: either by naming the variables or by using
a ..
Now, let’s display the estimated values \(\widehat{\boldsymbol{\beta}}\) of \(\boldsymbol{\beta}\).
## (Intercept) T9 T12 T15 Ne9 Ne12 Ne15 Vx9
## 12.24441987 -0.01901425 2.22115189 0.55853087 -2.18909215 -0.42101517 0.18373097 0.94790917
## Vx12 Vx15 maxO3v
## 0.03119824 0.41859252 0.35197646
[Proposition 2]:
Let us consider the linear regression model \(\mathbf{Y} = \mathbb{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\). Under the Rank assumption and [P1], the LSE \(\widehat{\boldsymbol{\beta}}\) is an unbiased estimator of \(\boldsymbol{\beta}\): \[\forall\,\boldsymbol{\beta}\in\mathbb R^{p+1},\quad \text{E}_{\boldsymbol{\beta}}[\widehat{\boldsymbol{\beta}}]=\boldsymbol{\beta}.\]
Proof. As \(\mathbb{X}\) is a deterministic matrix and \(\mathbf{Y} = \mathbb{X}\boldsymbol{\beta}+ \boldsymbol{\epsilon}\), it comes \[\begin{eqnarray*} \text{E}_{\boldsymbol{\beta}}[\widehat{\boldsymbol{\beta}}]&=&\text{E}_{\boldsymbol{\beta}}[(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top\mathbf{Y}] =\text{E}_{\boldsymbol{\beta}}[(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top (\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon})]\\ &=&(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \mathbb{X}\boldsymbol{\beta}+(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \text{E}_{\boldsymbol{\beta}}[\boldsymbol{\epsilon}]=\boldsymbol{\beta}. \end{eqnarray*}\] As \((\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \mathbb{X}=\mathbf{I}_{p+1}\) and \(\text{E}_{\boldsymbol{\beta}}[\boldsymbol{\epsilon}]=\boldsymbol{0}_n\square\)
[Proposition 3]: Let \(\mathbf{Y}=\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\), awith \(\boldsymbol{\beta}\in\mathbb{R}^{p+1}\) and \(\mathbb{X}\) the \(n\times (p+1)\) full rank design matrix, then \(\widehat {\boldsymbol{\beta}}=\widehat {\boldsymbol{\beta}}(\mathbf{Y})=(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \mathbf{Y}\) is such that under [P1] (\(\text{E}_{\boldsymbol{\beta}}[\boldsymbol{\epsilon}]=\boldsymbol{0}_n\)), [P2]-[P3] (\(\text{Var}_{\boldsymbol{\beta}}[\boldsymbol{\epsilon}]=\sigma^2\mathbf{I}_n\)) \[ \text{Var}_{\boldsymbol{\beta}}[\widehat{\boldsymbol{\beta}}]=\sigma^2(\mathbb{X}^\top \mathbb{X})^{-1} \]
Proof. \[\begin{eqnarray*} \text{Var}_{\boldsymbol{\beta}}[\widehat{\boldsymbol{\beta}}] &=&\text{Var}_{\boldsymbol{\beta}}[(\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y}]\\ & = & (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \text{Var}_{\boldsymbol{\beta}}(\mathbf{Y}) ((\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top )^\top \\ &=& (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \sigma^2 \mathbf{I}_n X ((\mathbb{X}^\top \mathbb{X})^{-1})^\top\quad \mbox{ as } \text{Var}_{\boldsymbol{\beta}}(\mathbf{Y}) = \sigma^2 \mathbf{I}_n \\ &=& \sigma^2 (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top X (\mathbb{X}^\top \mathbb{X})^{-1}\quad \mbox{ as } (\mathbb{X}^\top \mathbb{X})^{-1} \mbox{ is symetric } \\ &=& \sigma^2 (\mathbb{X}^\top \mathbb{X})^{-1}\square \end{eqnarray*}\]
[Theorem 1: (Gauss-Markov theorem)]:
Let \(\mathbb{X}\) be a design matrix of size \(n \times (p+1)\) with full rank (\(\text{rank}(\mathbb{X}) = p+1\)). Let \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), with \(\boldsymbol{\beta} \in \mathbb{R}^{p+1}\) being a linear model. Assuming [P1]-[P2]-[P3] are satisfied, the ordinary least squares estimator \[ \widehat{\boldsymbol{\beta}} = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y} \] is optimal in the class \(\mathcal{E}\) of unbiased linear estimators (i.e., linear combinations of the observations), i.e., for all \[ \lambda^\top \left(\text{Var}_{\boldsymbol{\beta}}[\widetilde{\boldsymbol{\beta}}] - \text{Var}_{\boldsymbol{\beta}}[\widehat{\boldsymbol{\beta}}] \right) \lambda \geq 0, \quad \forall \widetilde{\boldsymbol{\beta}} \in \mathcal{E}, \quad \forall \lambda \in \mathbb{R}^{p+1}. \]
Proof. Let \(C\) be a matrix of size \(r \times n\) (\(r = p+1\)). Let \(\widetilde{\boldsymbol{\beta}} = C \mathbf{Y} = C \mathbb{X} \boldsymbol{\beta} + C \boldsymbol{\epsilon}\) be an unbiased linear estimator. We have: \[\begin{eqnarray*} &&\text{E}_{\boldsymbol{\beta}}[C \mathbf{Y}] - \boldsymbol{\beta} = 0 \Leftrightarrow C \mathbb{X} \boldsymbol{\beta} - \boldsymbol{\beta} = \boldsymbol{0}_r \\ &&\Leftrightarrow (C \mathbb{X} - \mathbf{I}_r) \boldsymbol{\beta} = \boldsymbol{0}_p \Leftrightarrow C \mathbb{X} = \mathbf{I}_r \, \text{ and } \, \mathbb{X}^\top C^\top = \mathbf{I}_r \end{eqnarray*}\] Thus, for \(P_X\) and \(P_{X^\perp}\), the orthogonal projector onto \([X]\) and \([X]^\perp\), we have: \[\begin{eqnarray*} &&\text{Var}_{\boldsymbol{\beta}}(\widetilde{\boldsymbol{\beta}}) - \text{Var}_{\boldsymbol{\beta}}(\widehat{\boldsymbol{\beta}}) \\ &=& \sigma^2 (CC^\top - (\mathbb{X}^\top \mathbb{X})^{-1}) = \sigma^2 C (\mathbf{I}_n - \mathbb{X} (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top) C^\top \\ &=& \sigma^2 C (\mathbf{I}_n - P_X) C^\top = \sigma^2 C P_{X^\perp} C^\top = \sigma^2 C P_{X^\perp} P_{X^\perp}^\top C^\top \end{eqnarray*}\] since \(P_{X^\perp} = P_{X^\perp}^2 = P_{X^\perp} P_{X^\perp}^\top\) (properties of the projector). Therefore, for all \(\lambda \in \mathbb{R}^{p+1}\), we have: \[ \lambda^\top (\text{Var}_{\boldsymbol{\beta}}(\widetilde{\boldsymbol{\beta}}) - \text{Var}_{\boldsymbol{\beta}}(\widehat{\boldsymbol{\beta}})) \lambda = \sigma^2 \lambda^\top C P_{X^\perp} P_{X^\perp}^\top C^\top \lambda = \|P_{X^\perp}^\top C^\top \lambda\|^2 \geq 0. \qquad \square \]
📝 In the context of the Gauss-Markov theorem, the Ordinary Least Squares Estimator (OLSE) is considered the Best Linear Unbiased Estimator (BLUE).
[Proposition 4]:
Let \(\mathbb{X}\) be a design matrix of size \(n \times (p+1)\) with full rank (\(\text{rank}(\mathbb{X}) = p+1\)). Let \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), with \(\boldsymbol{\beta} \in \mathbb{R}^{p+1}\) being a linear model. Assuming [P1]-[P2]-[P3] are satisfied, then
\[\widehat{\boldsymbol{\beta}}\sim \mathcal{N}(\boldsymbol{\beta},\sigma^2(\mathbb{X}^\top \mathbb{X})^{-1}))\]
Proof. As \(\mathbf{Y}=\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\) is a linear transformation of the gaussian vector \(\boldsymbol{\epsilon}\). we conclude by Propositions 2 and 3.\(\square\)
Residuals (or estimated residuals) we define the residuals as follows: \[\widehat{\boldsymbol{\epsilon}}=\mathbf{Y}-\widehat{\mathbf{Y}}=(\mathbf{I}_n-P_X)\mathbf{Y}=P_{X^\perp}\mathbf{Y}\] where \(\widehat{\mathbf{Y}}\) is called the fitted value \[\widehat{\mathbf{Y}}=\mathbb{X}\widehat{\boldsymbol{\beta}}=\mathbb{X}(\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \mathbf{Y}=P_X\,\mathbf{Y}\]
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -53.566 -8.727 -0.403 0.000 7.599 39.458
[Theorem 2]: We consider the linear regression model \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), under the rank assumption \(\text{rank}(X) = p+1 = r\) and [P1]–[P4], then i.e., \(\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2 \mathbf{I}_n)\).
Let \(\widehat{\mathbf{Y}} = X\widehat{\boldsymbol{\beta}}(\mathbf{Y}) = P_X\,\mathbf{Y}\) and \(\widehat{\boldsymbol{\epsilon}} = \mathbf{Y} - \widehat{\mathbf{Y}} = P_{X^\perp}\mathbf{Y}\). Then:
\(\widehat{\mathbf{Y}} \sim \mathcal{N}(\mathbb{X}\boldsymbol{\beta}, \sigma^2 P_X) \quad \text{and} \quad \widehat{\boldsymbol{\epsilon}} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2 P_{X^\perp})\).
\(\widehat{\mathbf{Y}}\) is independent of \(\widehat{\boldsymbol{\epsilon}}\),
\(\frac{\|\widehat{\mathbf{Y}} - \mathbb{X}\boldsymbol{\beta} \|^2}{\sigma^2}\) is independent of \(\frac{\|\widehat{\boldsymbol{\epsilon}} \|^2}{\sigma^2}\), \(\quad \frac{\|\widehat{\mathbf{Y}} - \mathbb{X}\boldsymbol{\beta} \|^2}{\sigma^2} \sim \chi^2_r \quad \text{and} \quad \frac{\|\widehat{\boldsymbol{\epsilon}} \|^2}{\sigma^2} \sim \chi^2_{n-r}\).
\(\widehat{\boldsymbol{\epsilon}}\) is independent of \(\widehat{\boldsymbol{\beta}}\).
Proof.
Points 1,2 and 3. Note that \(\mathbf{Y} \sim \mathcal{N}(\mathbb{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}_n)\). Moreover \(\quad P_X \mathbf{Y} = \widehat{\mathbf{Y}}\), and \(P_{X^\perp} \mathbf{Y} = \widehat{\boldsymbol{\epsilon}}\), therefore we conclude by Cochran’s theorem with
4. As \[ (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top P_X \mathbf{Y} = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbb{X} (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y} = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top \mathbf{Y} = \widehat{\boldsymbol{\beta}}. \]
and since \(P_X \mathbf{Y}\) is independent of \(P_{X^\perp} \mathbf{Y}\)\(\square\)
[Proposition 5]:
We consider the linear regression model \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), under the rank assumption \(\text{rank}(X) =( p+1 )= r\quad\) and [P1]–[P4], then i.e., \(\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2 \mathbf{I}_n)\). An unbiased estimator of \(\sigma^2\) is \(\widehat{\sigma}^2\):
\[ \widehat{\sigma}^2 = \frac{\|\mathbf{Y} - X \widehat{\boldsymbol{\beta}} \|^2}{n-r} = \frac{\|\widehat{\boldsymbol{\epsilon}} \|^2}{n-r}. \]
Proof.
\[ \widehat{\sigma}^2 = \frac{\|\mathbf{Y} - X \widehat{\boldsymbol{\beta}} \|^2}{n-r} = \frac{\|\mathbf{Y} - P_X \mathbf{Y} \|^2}{n-r} = \frac{\|\widehat{\boldsymbol{\epsilon}} \|^2}{n-r} = \frac{\|P_{X^\perp} \mathbf{Y} \|^2}{n-r}. \]
Since \(\sigma^{-1} \boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \mathbf{I}_n)\), by Cochran’s theorem, we have
\[ \|P_{X^\perp} (\sigma^{-1} \boldsymbol{\epsilon}) \|^2 \sim \chi^2_{n-r}. \]
Thus, since \(\text{E}_{\boldsymbol{\beta}}[\chi^2_{n-r}] = n-r\) and
\[ \text{E}_{\boldsymbol{\beta}}[\|\widehat{\boldsymbol{\epsilon}} \|^2] = \text{E}_{\boldsymbol{\beta}}[\|P_{X^\perp} \boldsymbol{\beta} \|^2] = \sigma^2 \text{E}_{\boldsymbol{\beta}}[\|P_{X^\perp} (\sigma^{-1} \boldsymbol{\epsilon}) \|^2] = \sigma^2 (n-r).\square \]
[Theorem 3]:
We consider the linear regression model \(\mathbf{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}\), under the rank assumption \(\text{rank}(X) = p+1 = r\) and [P1]–[P4], then i.e., \(\boldsymbol{\epsilon} \sim \mathcal{N}(\boldsymbol{0}_n, \sigma^2 \mathbf{I}_n)\).
Proof.
Next \(\sigma^{-1}\boldsymbol{\epsilon}\sim\mathcal{N}(\boldsymbol{0}_n,\mathbf{I}_n)\) and \(P_{ X^\perp}\) is an orthogonal projector of rank \((n-r)\). Then by Cochran’s theorem \[ \|P_{ X^\perp}(\sigma^{-1}\boldsymbol{\epsilon})\|^2\sim \chi^2_{(n-r)}. \] Then \[\text{E}_{\boldsymbol{\beta}}[\|\widehat{\boldsymbol{\epsilon}}\|^2]=\text{E}_{\boldsymbol{\beta}}[\|P_{ X^\perp}\boldsymbol{\epsilon}\|^2] =\sigma^2\text{E}_{\boldsymbol{\beta}}[\|P_{ X^\perp}(\sigma^{-1}\boldsymbol{\epsilon})\|^2] = \sigma^2\text{E}_{\boldsymbol{\beta}}[\chi^2_{(n-r)}] =\sigma^2\,(n-r).\]
Points 3. and 4. By Cochran’s Theorem, \(P_{X^\perp} \mathbf{Y}\) is independent of \(P_X \mathbf{Y}\). Therefore, as \(\widehat{\sigma}^2 = \frac{\|P_{X^\perp} \mathbf{Y} \|^2}{n-r}\), and \(\widehat{\boldsymbol{\beta}} = (\mathbb{X}^\top \mathbb{X})^{-1} \mathbb{X}^\top P_X \mathbf{Y}\), we conclude. \(\square\)
We can display it with Ⓡ
cat("Residual standard error:", summary(mod)$sigma, "on", summary(mod)$df[2], "degrees of freedom\n")## Residual standard error: 14.36002 on 101 degrees of freedom
📈 Here
Residual standard erroris the value of \(\widehat\sigma\), then \(\widehat{\sigma}^2=14.36002^2\), and
101 degrees of freedom is the degrees of freedom of the
\(\chi^2_{n-r}\) law
[Proposition 6]:
The maximum likelihood estimator of \((\boldsymbol{\beta},\sigma^2)\) is \((\widehat{\boldsymbol{\beta}}_{ML},\widehat\sigma_{ML}^2)\) where \[ \widehat{\boldsymbol{\beta}}_{ML} = (\mathbb{X}^\top \mathbb{X})^{-1}\mathbb{X}^\top \mathbf{Y},\quad \widehat\sigma_{ML}^2=\frac{\|P_{X^\perp}\mathbf{Y}\|^2}{n} \]
COMMENTS
📝 In this gaussian setting, the maximum likelihood estimator (MLE) \(\widehat\beta\) is the LSE.
📝 Note that the MLE of \(\sigma^2\) is a biased estimator and equals to \[\widehat\sigma_{ML}^2=\frac{\|\mathbf{Y}-\mathbb{X}\widehat{\boldsymbol{\beta}}\|^2}{n}=\frac{\|\widehat{\boldsymbol{\epsilon}}\|^2}{n}.\] To be compared with the unbiased estimator: \[ \widehat\sigma^2=\frac{\|\mathbf{Y}-\mathbb{X}\widehat{\boldsymbol{\beta}}\|^2}{n-r}=\frac{\|\widehat{\boldsymbol{\epsilon}}\|^2}{n-r}. \]
[Theorem 4]:
Let us consider the linear regression model \(\mathbf{Y}=\mathbb{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}\). Assume \(X\) is full rank \(r=(p+1)\) and [P1]–[P4]. We have
\(\forall c\in\mathbb{R}^r\), we have \[ \frac{c^\top \widehat{\boldsymbol{\beta}}-c^\top{\boldsymbol{\beta}}}{\widehat\sigma\sqrt{c^\top (\mathbb{X}^\top \mathbb{X})^{-1}c}}\sim t_{n-r}. \]
\(\forall C \in \mathbb{R}^{q\times r}\) of full rank \(q\) (\(q\leq r\)), we have \[ \frac{(C\widehat{\boldsymbol{\beta}}-C\boldsymbol{\beta})^\top (C(\mathbb{X}^\top \mathbb{X})^{-1}C^\top)^{-1}(C\widehat{\boldsymbol{\beta}}-C\boldsymbol{\beta})}{q\;\widehat\sigma^2}\sim{\text{F}}_{(q,n-r)}. \]
Proof.
By Proposition 4, \(\widehat\beta\sim\mathcal{N}(\beta,\sigma^2(X\top X)^{-1})\), then \(c^\top\widehat\beta\sim\mathcal{N}(c^\top\beta,\sigma^2c^\top(X\top X)^{-1}c)\) as a linear transformation of the previous gaussian vector. Therefore \[\left\{ \begin{array}{l} \eta:=\frac{c^\top\widehat\beta-c^\top\beta}{\sqrt{\sigma^2c^\top(X\top X)^{-1}c)}}\sim\mathcal{N}(0,1)\\ \mathcal{X}:=\frac{(n-r)\widehat\sigma^2}{\sigma^2}\sim\chi^2(n-r)\quad\text{by Theorem 3 }\\ \eta\text{ is independant of }\mathcal{X}\quad\text{by Theorem 3 } \end{array}\right. \] Therefore \[ A:=\frac{\eta}{\sqrt{\mathcal{X}/(n-r))}}=\frac{c^\top\widehat\beta-c^\top\beta}{\sqrt{\widehat\sigma^2c^\top(X\top X)^{-1}c}}\sim t_{n-r}\square \]
Will be done in the tutorial.\(\square\)
📝 This result will be very important for building confidence regions and hypothesis testing.
Definition [\(\chi^2\) distribution]
Let \(Z_1,\ldots,Z_n\) be i.i.d. \(\mathcal{N}(0,1)\). Then \(\sum_{i=1}^nZ_i^2\sim \chi^2_n\) follows a \(\chi^2\)-distribution with \(n\) degrees of freedom.
Definition [Student \(t\) distribution]
Let \(Z\sim \mathcal{N}(0,1)\) and \(U \sim \chi^2_m\) with \(Z\) independant of \(U\). Then \[ \frac{Z}{\sqrt{U/(m)}}\sim t_{m}. \]
Definition [Fisher \(\text{F}\) distribution]
Let \(U\sim \chi^2_p\) and \(V\sim \chi^2_q\) and \(U\) independant of \(V\). Then \[ F = \frac{U/p}{V/q} \sim \text{F}_{(p,q)}. \]
follows a Fisher distribution with parameters \((p,q)\).
[Cochran’s Theorem]:
Let \(Z\) be a Gaussian random vector in \(\mathbb{R}^n\) with \(Z \sim \mathcal{N}(\boldsymbol{\mu}, \sigma^2 \mathbf{I}_n)\), where \(\boldsymbol{\mu} \in \mathbb{R}^n\) and \(\sigma > 0\). Let \(F_{1},\ldots ,F_{m}\) be subspaces of dimension \(d_i\), orthogonal to each other such that \(\mathbb {R} ^{n} = F_1 \oplus \dots \oplus F_m\). For \(i=1,\ldots,m\), let \(P_{F_i}\) denote the orthogonal projection matrix onto \(F_i\). Then:
The random vectors \(P_{F_{1}}Z,\ldots ,P_{F_{m}}Z\) have respective distributions \(\mathcal{N}(P_{F_{1}}\boldsymbol{\mu}, \sigma^2 P_{F_{1}}),\ldots ,\mathcal{N}(P_{F_{m}}\boldsymbol{\mu}, \sigma^2 P_{F_{m}})\).
The random vectors \(P_{F_{1}}Z,\ldots ,P_{F_{m}}Z\) are pairwise independent.
The random variables \(\frac{\|P_{F_{1}}(Z-\boldsymbol{\mu})\|^2}{\sigma^2},\ldots ,\frac{\|P_{F_{m}}(Z-\boldsymbol{\mu})\|^2}{\sigma^2}\) have respective distributions \(\chi^2(d_{1}),\ldots ,\chi^2(d_{m})\).
The random variables \(\frac{\|P_{F_{1}}(Z-\boldsymbol{\mu})\|^2}{\sigma^2},\ldots ,\frac{\|P_{F_{m}}(Z-\boldsymbol{\mu})\|^2}{\sigma^2}\) are pairwise independent.
Proof. Will be done in the tutorial.\(\square\)